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Abstract 

Recently, the development of biobanks linked to electronic medical records has presented new opportunities for genetic 
and epidemiological research. Studies based on these resources, however, present unique challenges, including the 
accurate assignment of individual-level population ancestry. In this work we examine the accuracy of administratively- 
assigned race in diverse populations by comparing assigned races to genetically-defined ancestry estimates. Using 220 
ancestry informative markers, we generated principal components for patients in our dataset, which were used to cluster 
patients into groups based on genetic ancestry. Consistent with other studies, we find a strong overall agreement (Kappa 
= 0.872) between genetic ancestry and assigned race, with higher rates of agreement for African-descent and European- 
descent assignments, and reduced agreement for Hispanic, East Asian-descent, and South Asian-descent assignments. 
These results suggest caution when selecting study samples of non-African and non-European backgrounds when 
administratively-assigned race from biobanks is used. 
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Introduction 

Hospital-based biobanks linked to electronic medical records 
(EMRs) are a growing and cost-effective way to ascertain large 
segments of a population for biomedical research studies. Genetic 
and clinical studies increasingly require larger numbers of samples 
to provide statistical power to discover genetic variation associated 
with complex human diseases; using existing surveyed clinical 
populations is a way to meet this demand quickly. Multiple studies 
have been published illustrating the basic utility of biobanks for 
validating existing association studies [1], performing phenome- 
wide association studies [2,3], and for identifying novel genetic 
associations within existing genotype-phenotype databases [4]. 
The use of EMR-based biobanks for research purposes is expected 
to grow in the coming years [5,6] . 

The Vanderbilt DNA biobank (BioVU) contains nearly 160,000 
DNA samples linked to electronic medical records at Vanderbilt 
University and continues to accrue additional patient samples. 
DNA is extracted from discarded blood samples collected during 
routine patient care. EMR data is drawn from administrative 
databases and scrubbed of identifying information to generate a 
resource for researchers known as the Synthetic Derivative (SD) 
[1,7]. A subset of the SD population has linked DNA samples, 
forming the BioVU subset. Upon institutional approval of a 
BioVU project, samples with the phenotype of interest, based on 
data from the SD, can be accessed and genotyped. All genotype 
data generated using BioVU samples is then made available to 



Vanderbilt investigators for future studies. The BioVU design has 
the distinct advantage of rapid sample accrual for a variety of 
clinical traits present in the patient population; however, re- 
contacting participants for sample collection or validation of 
subject data is prohibited by both institutional policy and the de- 
identification process, limiting some applications of the data. 

With increased emphasis on the use of DNA biobanks, it is 
important to note the critical role of race in genetic association 
studies. A sample drawn from multiple underlying populations is 
subject to population stratification, where each population has a 
slightly different genetic architecture. If not properly accounted 
for, these differences in allele frequency can result in false 
associations. As such, it is common practice in genetic studies to 
correct for underlying population sub-structure by estimating 
global genetic ancestry for each sample [8]. This is often 
accomplished by genotyping a set of ancestry informative markers 
(AIMs) which are evaluated using either principal components 
analysis (for a continuous estimate of ancestry group) [9] or cluster 
analysis (for a categorical ancestry assignment) [10]. The 
individual measure of genetic ancestry is then used to stratify 
individuals or to include them as a covariate for adjustment in 
statistical analyses to avoid confounding. 

In lieu of genotyping AIMs, genetic studies sometimes use self- 
reported race as a covariate, either as a surrogate for genetic 
ancestry or to capture social and demographic components [11]. 
The complex nature of the relationship between race and genetic 
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ancestry has been extensively explored [12], and multiple studies 
have shown that self-reported race is generally reflective of an 
individual's genetic ancestry but does not account for population 
substructure [13,14]. While self-reported race is commonly 
collected in epidemiologic cohorts, many provider-based studies 
use third-party reported race rather than self-reported race. 
Studies of agreement between self and third-party race assignment 
have been conducted, but have conflicting results, showing varying 
levels of agreement [13- 16]. 

Dumitrescu et al. [17] previously reported on the utility of using 
third-party reported race for African-descent and European- 
descent individuals within BioVU, citing a high concordance with 
genetic ancestry. However, third-party assignment of these racial 
categories may be influenced by subjective criteria for specific 
racial groups. This notion is supported by a study that reported 
high accuracy for distinguishing African American and European 
American individuals (positive predictive value 0.95 & 0.94, 
respectively) using third-party reporting, but less accuracy for 
Hispanics and American Indians (positive predictive value 0.81 & 
0.50, respectively) [18]. 

The accuracy of third-party racial assignments is especially 
critical for biobank-based studies. Should an investigator seek to 
perform a genetic study within a diverse population, sample 
selection is likely dependent on the third-party racial assignment 
within the EMR. As a result, samples of a different ethnicity may 
be selected and genotyped, only to be excluded from analysis after 
ancestry is determined using genetic data, resulting in a waste of 
research funds. Additionally, genetic ancestry can influence some 
clinical decision-making processes, including automated decision 
support, which is being integrated into some EMRs [19,20]. 
Before decision support rules are implemented that consider race 
in treatment decisions, it is important to characterize the accuracy 
of race within EMRs. In this work, we characterize how well 
administrative third-party race assignment within BioVU reflects 
ancestry estimated from genetic data. 

Methods 

Ethics Statement 

BioVU, Vanderbilt University's biobank, uses de-identified patient 
electronic medical records. This study is considered non-human 
subjects research by the Vanderbilt institutional review board. 

Sample Selection 

A total of 7,252 individuals were selected from BioVU, 
specifically to over-represent diverse populations and individuals 
with "unknown" administrative race assignments. Within the 
synthetic derivative (SD) and BioVU, race is administratively 
assigned to one of eight predefined categories: White (W), Black 
(B), Asian/Pacific (A), Native American (N), Indian (I), Hispanic 
(H), other (O), or unknown (U) (Table 1). Based on communica- 
tions with clinical personnel who regularly assign race codes, in 
practice, the Native American (American Indian) and Indian 
(South Asian) race codes are sometimes incorrectly used 
interchangeably. No individuals with "other" ethnicity were 
selected in this study. For this paper, we will refer to the 
predefined, administratively-assigned racial categories as Cauca- 
sian, African American, Asian/Pacific, Native American, Indian, 
and Hispanic (Table SI). 

Genotyping 

All 7,252 BioVU samples were genotyped using the Illumina 
VeraCode GoldenGate assay in the Center for Human Genetics 
Research (CHGR) DNA Resources Core at Vanderbilt University 



for 308 ancestry informative markers (ATMs) and scanned on the 
Illumina BeadXpress reader. AIMs genotypes were merged with 
existing data for 805 individuals from the International HapMap 
Project (Phase 3, Revision3, Build 36), including 165 CEU, 203 
YRI, 137 CHB, 113 JPT, 101 GIH samples, and 86 MXL, as 
reference populations to assist in determining genetic ancestry 
(Table SI). The genetic data underwent quality control measures, 
including removal of 39 non-autosomal SNPs, 38 SNPs not also in 
the HapMap dataset, and 1 1 SNPs that were co-linear with 
principal component (PC) three and caused atypical clustering, 
leaving 220 SNPs for analysis (SNP list available upon request). 
Within the final merged dataset of 220 SNPs for 8,057 individuals, 
all SNPs had a minor allele frequency (MAF) greater than five 
percent. Of the BioVU samples in our dataset, 52% (4,192) were 
female. 

Genetic Ancestry Assignment 

We performed principal components analysis (PCA) for 220 
SNPs using the EIGENSTRAT package [9] on the combined 
samples. Outlier removal was disabled for all EIGENSTRAT 
analyses. Consistent with published studies [9], we generated the top 
ten principal components to estimate genetic ancestry based on 
genetic sharing of SNPs with HapMap samples of known 
continental origin. To assign genetic ancestry for each individual 
we performed model-based clustering, using the mclust [21] R 
package, to define and assign individuals to clusters using an 
ellipsoidal model with varying volume, shape, and orientation. We 
indicated that mclust should define five clusters in order to 
differentiate the five ancestry groups known to be present in the 
dataset (European-descent, African-descent, East Asian-descent, 
South Asian-descent, and Hispanic-descent). By plotting a 10 by 10 
matrix of all pairs of PCs, colored by the defined clusters, we visually 
determined that PCs 1, 2, 3, 7, 9, and 10 optimally captured 
separation of the five clusters. These six PCs were used to perform 
clustering. Genetic variance within the European-descent cluster 
was captured in the unused principal components, and may reflect a 
bias toward European-descent components within this set of AIMs. 

Statistical Methods 

Administratively-assigned race was compared to cluster-based 
ancestry assignment (Table 2) through contingency table analysis 
using ST ATA 12. Additionally, comparisons for HapMap cluster 
assignment is shown in Table S2. Agreement between these two 
classification methods was measured by Cohen's Kappa coefficient 
[22], which takes into account the expected agreement of two 
'raters' based on the distribution of categories within the dataset. 
In this context, administrative assignment is the first 'rater' and 
genetically determined ancestry is the second 'rater'. Kappa is 
standardized on a scale from -1 to 1, where 1 indicates perfect 
agreement, 0 indicates agreement that would be expected by 
chance, and negative values indicate less agreement that would be 
expected by chance. Genetic ancestry categories are mutually 
exclusive, so an individual can only be assigned to one category, 
based on clustering from principal components analysis. 

Results 

The distribution of administratively-assigned race across the 
sample used in this study, within BioVU, and within the entire 
synthetic derivative (SD) — as well as population-level counts for 
Davidson County Tennessee — are shown in Table 2. Plotting PC 
1 versus PC 2 (Figure 1A) shows differentiation between 
Caucasian, African American, and Asian/Pacific assigned indi- 
viduals, with Hispanic, Native American, and Indian assigned 
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Table 1. Distribution of administratively-assigned race. 





Race 


Study Sample 


BioVU 


Synthetic Derivative 


Davidson Co.* 


Caucasian 


4,232 (58.4%) 


102,018 (64.4%) 


1,116,837 (51.6%) 


385,039 (61.4%) 


African American 


1,094 (15.1%) 


14,223 (9.0%) 


191,246 (8.8%) 


1 73,730 (27.7%) 


Asian/Pacific 


228 (3.1%) 


1,380 (0.9%) 


14,449 (0.7%) 


15,083 (2.4%) 


Hispanic 


230 (3.2%) 


2,147 (1.3%) 


37,466 (1.7%) 


** 


Native American 


184 (2.5%) 


212 (0.1%) 


1,868 (0.1%) 


2,091 (0.3%) 


Indian 


7 (0.1%) 


1,711 (1.1%) 


20,613 (1.0%) 


4,338 (0.7%) 


Unknown 


1,277 (17.6%) 


36,696 (23.2%) 


781,074 (36.1%) 


46,400 (7.5%) 


Total 


7,252 (100%) 


158,387 (100%) 


2,163,553 (100%) 


626,681 (100%) 



Race categories listed are based on classification options originating from the SD. Our BioVU dataset contained no individuals labeled Other (O). Vanderbilt University 
Medical Center is located in Davidson County, TN. 2010 US census data is shown for Davidson County, Tennessee [25]. * For Davidson County, "Asian/Pacific" includes 
Asian (Non-Indian), Native Hawaiian, and Pacific Islander individuals, "Native American" includes Native American (American Indian) and Alaskan Native individuals, 
"Indian" includes Asian Indian individuals, and "Unknown" includes 'some other race' and individuals who reported two or more races for the census. ** "Hispanic" is 
not listed a race in the US Census; rather, Hispanic-origin is indicated and is not exclusive to any racial category. For example, 25,1 56 individuals in Davidson County who 
self-identified as 'White' also self-identified, separately, as Hispanic. Within Davidson County, 9.8% of individuals indicated Hispanic origin. 
doi:1 0.1 371 /journal.pone.00991 61 .t001 



individuals falling between the three foci. Results from the model- 
based clustering are shown in Figure IB. Clusters for European- 
descent, African-descent, and East Asian-descent clusters are 
distinct. The South Asian-descent and Hispanic-descent clusters 
are less defined, due to their varying degrees of admixture. Our 
ability to make inferences about the accuracy of Native American 
and Indian codes is limited due to ambiguous use of these codes in 
clinical practice, limited availability of Native American HapMap 
reference populations, and small sample size within our dataset. 
Kappa (K) measures of agreement between third-party race 
assignment and estimated genetic ancestry are shown in Table 3 
(more detailed information on Kappa statistics shown in Table 
S3). Over the entire dataset, agreement was reasonably high 
(K = 0.872), largely driven by European-descent (K = 0.906) and 
African-descent (K = 0.964) individuals. Less agreement was seen 



for East Asian-descent (K = 0.825) and Hispanic-descent 
(K = 0.718) individuals. We also assessed agreement between 
individuals with Native American (N) and Indian (I) racial codes 
and South Asian ancestry estimated by the Gujarati Indian 
reference samples (GIH) to examine the hypothesis that these 
codes predominantly represent South-Asian ancestry. This agree- 
ment (K = 0.284) was expectedly low, indicating that while they 
may be misappropriated in the clinical environment, it is not 
strongly in favor of South-Asian ancestry. Notably, when 
stratifying by sex, we observe similar Kappa agreement values 
for European and African-descent genetic ancestry groups. In 
other groups, females tend to have slightly higher Kappa values 
than males, with the largest difference in agreement by sex 
observed for individuals in the South Asian-descent genetic cluster. 
In addition to using Kappa statistics to measure agreement, 



Table 2. Percentages of each administratively-assigned race 


assigned to each genetic ancestry group. 






Genetic Ancestry 




European 


African 


East Asian 


Hispanic 


South Asian 


Administratively-Assigned Caucasian 
Race 


4,174 


24 


8 


16 


10 




(98.6%) 


(0.6%) 


(0.2%) 


(0.4%) 


(0.2%) 


African American 


11 


1,080 


0 


3 


0 




(1.0%) 


(98.7%) 


(0.0%) 


(0.3%) 


(0.0%) 


Asian/Pacific 


9 


0 


182 


2 


35 




(3.9%) 


(0.0%) 


(79.8%) 


(0.9%) 


(15.4%) 


Hispanic 


58 


8 


2 


154 


8 




(25.2%) 


(3.5%) 


(0.8%) 


(67.0%) 


(3.5%) 


Native American 


90 


17 


18 


18 


41 




(48.9%) 


(9.2%) 


(9.8%) 


(9.8%) 


(22.3%) 


Indian 


3 


2 


0 


0 


2 




(42.8%) 


(28.6%) 


(0.0%) 


(0.0%) 


(28.6%) 


Unknown 


1,126 


83 


26 


21 


21 




(88.3%) 


(6.5%) 


(2.0%) 


(1.6%) 


(1.6%) 



Percentages reflect the proportion of individuals assigned to a genetic ancestry cluster for given administratively-assigned race. 
doi:1 0.1 371 /journal.pone.00991 61 .t002 
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agreement can be visualized as the percent of individuals with a 
given administratively-assigned race assigned to each of the five 
genetic ancestry clustering groups (Table 2). We also examined the 
genetic ancestry of individuals with race status "unknown" to 
determine if some groups were more likely to be assigned this status 
than others (Table S4). The majority (88.2%) of samples with 
"unknown" race are genetically of European-descent, consistent 
with the overall representation of European-descent individuals in 
BioVU. African-descent individuals constitute 6.5% of the "un- 
known" individuals, while East Asian-descent, South Asian-descent, 
and Hispanic-descent individuals, each, constitute about 2%. 

Discussion 

Genetic and epidemiological studies routinely use self-reported 
race or genetic ancestry to adjust for confounding factors and/ or to 
tailor genetic effects to specific population subgroups. Global 
genetic ancestry is often used to correct for population stratification 
in genetic analyses, because it roughly reflects differences in allele 
frequencies between continental populations. The social construct 
of race is often used to capture other demographic factors, such as 
access to care, dietary and environmental exposures, and socioeco- 
nomic status. Self-reported race has been shown to be highly 
correlated to genetic ancestry and is often used as a surrogate for 
continental ancestry. In many clinical datasets, self-reported 
ancestry is not available and various administrative procedures 
are used to assign race status. While it is unknown to what degree 
administratively-assigned race captures the various social and 
cultural aspects of an individual, in this work we show that it has 
only moderate agreement with genetic ancestry for certain 
populations. We observed strong agreement between administrative 
race assignment and genetically determined ancestry for European- 
descent and African-descent individuals; there was less agreement 



between assigned race and genetic ancestry for East Asian-descent, 
South Asian-descent, and Hispanic-descent individuals. Given this 
fact, investigators should use caution when using administratively- 
assigned race as a proxy for genetic ancestry, and expect some 
misappropriation of racial categories by third party assignment. 

Interestingly, East Asian-descent, South Asian-descent, and 
Hispanic-descent individuals all have slightly different agreement 
statistics by sex, with females tending to have slightly higher 
agreement between administrative assignment and genetic ances- 
try. Previous studies have reported subjective misclassification of 
Hispanic individuals by sex, causing non-Hispanic females to be 
classified as Hispanic because of adopted spousal surnames [23]. 
In our data the agreement is biased slighdy in the opposite 
direction, with females having more accurate administratively- 
assigned race, based on genetic ancestry estimates. While 
somewhat unexpected, this could be because third-party assigners 
are more comfortable asking females, rather than males, questions 
about their race and ethnicity [24]. 

Approximately 18% of the individuals in our dataset had an 
administratively-assigned race specified as "unknown" (Table 1). 
The distribution of genetic ancestries within these samples was 
significandy different from the larger dataset, with more Europe- 
an-descent individuals than expected (results not shown). As a 
result, "unknown" race in BioVU should not be used as an 
indicator of minority population status — it is far more likely that 
individuals with "unknown" race are of European-descent. 

In conclusion, administratively assigned race is an accurate 
predictor of genetic ancestry for the ascertainment of European- 
descent and African-descent individuals, but is less accurate for 
other diverse populations. Investigators accessing Asian-descent or 
Hispanic-descent populations should expect a moderate number of 
samples to have administrative race labels inconsistent with genetic 




PC1 PC2 PC1 




Figure 1 . Comparison of administratively-assigned race and genetic ancestry, based on principal component analysis. A) All pairwise 
combinations of principle components (PCs) 1 through 3, by administratively assigned race. B) All pairwise combinations of PCs 1 through 3, by 
cluster assignments corresponding to genetic ancestry. Comparison of Frames 1A andlB indicate individuals with administratively assigned race 
different than their genetically defined ancestry cluster. For example, the East Asian-descent cluster (IB; blue) contains individuals with 
administratively-assigned race (1A) of Caucasian (green), Hispanic (purple), and Other (orange). 
doi:10.1371/journal.pone.0099161.g001 
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Table 3. Agreement between genetic and assigned ancestry. 



Genetic Ancestry 


Overall 


Male 


Female 


Overall 


0.872 (0.009) 


0.862 (0.015) 


0.876 (0.012) 


European-descent 


0.906 (0.013) 


0.906 (0.020) 


0.904 (0.017) 


African-descent 


0.964 (0.013) 


0.970 (0.020) 


0.960 (0.017) 


East Asian-descent 


0.825 (0.013) 


0.800 (0.020) 


0.836 (0.017) 


Hispanic-descent 


0.718 (0.013) 


0.683 (0.020) 


0.738 (0.017) 


South Asian-descent 


0.284 (0.012) 


0.237 (0.018) 


0.318 (0.016) 



Notation: Cohen's Kappa coefficient (standard error). 

South Asian-descent includes individuals with Native American and Indian race codes in BioVU. 
Samples with administratively-assigned race of "Unknown" were excluded from this analysis. 
doi:1 0.1 371 /journal.pone.00991 61 .t003 



ancestry. When race is an important factor in a study, we 
recommend, when possible, that a low-cost genotyping array, such 
as a fixed content Illumina BeadChip (i.e. Illumina HumanCore 
array) be used to genotype ancestry-informative markers (AIMs) to 
determine genetic ethnicity. 

Supporting Information 

Table SI Race/ethnicity terminology usage. 

(DOC) 

Table S2 Percentages of each administratively-assigned 
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